Goto

Collaborating Authors

 target phenomenon


Measuring what Matters: Construct Validity in Large Language Model Benchmarks

Bean, Andrew M., Kearns, Ryan Othniel, Romanou, Angelika, Hafner, Franziska Sofia, Mayne, Harry, Batzner, Jan, Foroutan, Negar, Schmitz, Chris, Korgul, Karolina, Batra, Hunar, Deb, Oishi, Beharry, Emma, Emde, Cornelius, Foster, Thomas, Gausen, Anna, Grandury, María, Han, Simeng, Hofmann, Valentin, Ibrahim, Lujain, Kim, Hazel, Kirk, Hannah Rose, Lin, Fangru, Liu, Gabrielle Kaili-May, Luettgau, Lennart, Magomere, Jabez, Rystrøm, Jonathan, Sotnikova, Anna, Yang, Yushi, Zhao, Yilun, Bibi, Adel, Bosselut, Antoine, Clark, Ronald, Cohan, Arman, Foerster, Jakob, Gal, Yarin, Hale, Scott A., Raji, Inioluwa Deborah, Summerfield, Christopher, Torr, Philip H. S., Ududec, Cozmin, Rocher, Luc, Mahdi, Adam

arXiv.org Artificial Intelligence

Evaluating large language models (LLMs) is crucial for both assessing their capabilities and identifying safety or robustness issues prior to deployment. Reliably measuring abstract and complex phenomena such as 'safety' and 'robustness' requires strong construct validity, that is, having measures that represent what matters to the phenomenon. With a team of 29 expert reviewers, we conduct a systematic review of 445 LLM benchmarks from leading conferences in natural language processing and machine learning. Across the reviewed articles, we find patterns related to the measured phenomena, tasks, and scoring metrics which undermine the validity of the resulting claims. To address these shortcomings, we provide eight key recommendations and detailed actionable guidance to researchers and practitioners in developing LLM benchmarks.


Multi-Agent Vulcan: An Information-Driven Multi-Agent Path Finding Approach

Olkin, Jake, Parimi, Viraj, Williams, Brian

arXiv.org Artificial Intelligence

Scientists often search for phenomena of interest while exploring new environments. Autonomous vehicles are deployed to explore such areas where human-operated vehicles would be costly or dangerous. Online control of autonomous vehicles for information-gathering is called adaptive sampling and can be framed as a POMDP that uses information gain as its principal objective. While prior work focuses largely on single-agent scenarios, this paper confronts challenges unique to multi-agent adaptive sampling, such as avoiding redundant observations, preventing vehicle collision, and facilitating path planning under limited communication. We start with Multi-Agent Path Finding (MAPF) methods, which address collision avoidance by decomposing the MAPF problem into a series of single-agent path planning problems. We then present information-driven MAPF which addresses multi-agent information gain under limited communication. First, we introduce an admissible heuristic that relaxes mutual information gain to an additive function that can be evaluated as a set of independent single agent path planning problems. Second, we extend our approach to a distributed system that is robust to limited communication. When all agents are in range, the group plans jointly to maximize information. When some agents move out of range, communicating subgroups are formed and the subgroups plan independently. Since redundant observations are less likely when vehicles are far apart, this approach only incurs a small loss in information gain, resulting in an approach that gracefully transitions from full to partial communication. We evaluate our method against other adaptive sampling strategies across various scenarios, including real-world robotic applications. Our method was able to locate up to 200% more unique phenomena in certain scenarios, and each agent located its first unique phenomenon faster by up to 50%.


CauseJudger: Identifying the Cause with LLMs for Abductive Logical Reasoning

He, Jinwei, Lu, Feng

arXiv.org Artificial Intelligence

Large language models (LLMs) have been utilized in solving diverse reasoning tasks, encompassing common sense, arithmetic and deduction tasks. However, with difficulties of reversing thinking patterns and irrelevant premises, how to determine the authenticity of the cause in abductive logical reasoning remains underexplored. Inspired by hypothesis and verification method and identification of irrelevant information in human thinking process, we propose a new framework for LLMs abductive logical reasoning called CauseJudger (CJ), which identifies the authenticity of possible cause by transforming thinking from reverse to forward and removing irrelevant information. In addition, we construct an abductive logical reasoning dataset for decision task called CauseLogics, which contains 200,000 tasks of varying reasoning lengths. Our experiments show the efficiency of CJ with overall experiments and ablation experiments as well as case studies on our dataset and reconstructed public dataset. Notably, CJ's implementation is efficient, requiring only two calls to LLM. Its impact is profound: when using gpt-3.5, CJ achieves a maximum correctness improvement of 41% compared to Zero-Shot-CoT. Moreover, with gpt-4, CJ attains an accuracy exceeding 90% across all datasets.


Modelling Language

Grindrod, Jumbly

arXiv.org Artificial Intelligence

This paper argues that large language models have a valuable scientific role to play in serving as scientific models of a language. Linguistic study should not only be concerned with the cognitive processes behind linguistic competence, but also with language understood as an external, social entity. Once this is recognized, the value of large language models as scientific models becomes clear. This paper defends this position against a number of arguments to the effect that language models provide no linguistic insight. It also draws upon recent work in philosophy of science to show how large language models could serve as scientific models.


Near-Optimal Active Learning of Multi-Output Gaussian Processes

Zhang, Yehong (National University of Singapore) | Hoang, Trong Nghia (National University of Singapore) | Low, Kian Hsiang (National University of Singapore) | Kankanhalli, Mohan (National University of Singapore)

AAAI Conferences

This paper addresses the problem of active learning of a multi-output Gaussian process (MOGP) model representing multiple types of coexisting correlated environmental phenomena. In contrast to existing works, our active learning problem involves selecting not just the most informative sampling locations to be observed but also the types of measurements at each selected location for minimizing the predictive uncertainty (i.e., posterior joint entropy) of a target phenomenon of interest given a sampling budget. Unfortunately, such an entropy criterion scales poorly in the numbers of candidate sampling locations and selected observations when optimized. To resolve this issue, we first exploit a structure common to sparse MOGP models for deriving a novel active learning criterion. Then, we exploit a relaxed form of submodularity property of our new criterion for devising a polynomial-time approximation algorithm that guarantees a constant-factor approximation of that achieved by the optimal set of selected observations. Empirical evaluation on real-world datasets shows that our proposed approach outperforms existing algorithms for active learning of MOGP and single-output GP models.


Near-Optimal Active Learning of Multi-Output Gaussian Processes

Zhang, Yehong, Hoang, Trong Nghia, Low, Kian Hsiang, Kankanhalli, Mohan

arXiv.org Machine Learning

This paper addresses the problem of active learning of a multi-output Gaussian process (MOGP) model representing multiple types of coexisting correlated environmental phenomena. In contrast to existing works, our active learning problem involves selecting not just the most informative sampling locations to be observed but also the types of measurements at each selected location for minimizing the predictive uncertainty (i.e., posterior joint entropy) of a target phenomenon of interest given a sampling budget. Unfortunately, such an entropy criterion scales poorly in the numbers of candidate sampling locations and selected observations when optimized. To resolve this issue, we first exploit a structure common to sparse MOGP models for deriving a novel active learning criterion. Then, we exploit a relaxed form of submodularity property of our new criterion for devising a polynomial-time approximation algorithm that guarantees a constant-factor approximation of that achieved by the optimal set of selected observations. Empirical evaluation on real-world datasets shows that our proposed approach outperforms existing algorithms for active learning of MOGP and single-output GP models.


Robust Spatio-Temporal Signal Recovery from Noisy Counts in Social Media

Xu, Jun-Ming, Bhargava, Aniruddha, Nowak, Robert, Zhu, Xiaojin

arXiv.org Artificial Intelligence

Many real-world phenomena can be represented by a spatio-temporal signal: where, when, and how much. Social media is a tantalizing data source for those who wish to monitor such signals. Unlike most prior work, we assume that the target phenomenon is known and we are given a method to count its occurrences in social media. However, counting is plagued by sample bias, incomplete data, and, paradoxically, data scarcity -- issues inadequately addressed by prior work. We formulate signal recovery as a Poisson point process estimation problem. We explicitly incorporate human population bias, time delays and spatial distortions, and spatio-temporal regularization into the model to address the noisy count issues. We present an efficient optimization algorithm and discuss its theoretical properties. We show that our model is more accurate than commonly-used baselines. Finally, we present a case study on wildlife roadkill monitoring, where our model produces qualitatively convincing results.


A Skeptic Embrace of Simulation

Funcke, Alexander (Stockholm University)

AAAI Conferences

Skeptics tend not to be the first to jump on the next band- wagon. In quite a few areas of science, simulations and Com- plex Adaptive Systems (CAS) has been the bandwagon in question. This paper intends to reach out to the skeptics and convince them to hop-on; take over the controls and make the wagon do a U-turn and aim for the established scientific theories. The argument is that simulation techniques, such as Agent- Based Modelling (ABM), may possibly be epistemically problematic as one sets out to strongly corroborate theories concerned with our overly complex real world. However, us- ing the same techniques to explore the robustness of (or to falsify) existing abstract and idealised mathematical models will be to be epistemically uncomplicated. This allows us to study the effects of reintroduction of real-world traits, such as autonomy and heterogeneity that was previously sacrificed for mathematical tractability.